Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-5768] GCP cloud sql don't store ephemeral connection in db #6440

Merged

Conversation

dstandish
Copy link
Contributor

Jira

Description

  • Here are some details about my PR, including screenshots of any UI changes:

GCP cloud sql operator creates dynamically an ephemeral Connection object. It persists to metastore during execution and deletes afterward.

This behavior has negative impact on our ability to refactor creds management.

By not persisting to database, we can also remove some complexity re ensuring connection is deleted in event of failure, and the tests that go along with that.

This change requires that we add optional param connection to both MySqlHook and PostgresHook.

Incidentally, this PR incorporates test simplifications originally in unmerged PR #6390 and will render that PR obsolete.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

@mik-laj mik-laj added the provider:google Google (including GCP) related issues label Oct 27, 2019
@ashb
Copy link
Member

ashb commented Oct 27, 2019

Something like this gets my vote. WDYT @mik-laj? (It might be worth moving some of that logic down in to the base DbApiHook? Not sure.)

@dstandish
Copy link
Contributor Author

dstandish commented Oct 28, 2019

It might be worth moving some of that logic down in to the base DbApiHook? Not sure.

Yeah I thought about this too and was also unsure. Just "connection" is a term that has great potential for conflict / confusion / ambiguity, to add it to all hooks could be trouble. Maybe if we renamed Connection to AirflowConnection... Or perhaps better, Creds or Credential. Connection is not really a connection. Come to think of it, this causes a lot of confusing code (all the get_conn in the codebase).

An alternative approach would be to alter the hook to accept every param the mysql connections (actual connection) would need, so that the GCP Cloud SQL hook does not actually need to create a connection object -- it can just pass the appropriate params to postgres / mysql hook.

On this latter idea, some have discouraged adding params like password as init params for hook on the basis it would allow users to do bad things like put creds in the dag file under source control. I personally don't feel like we should be nanny state in that way. Constrain as we might, a novice user might still put pyodbc.connect in their dags, for example. So I don't see this as a reason not to add params.

@dstandish dstandish force-pushed the feature/gcp-do-not-store-ephemeral-conn-in-db branch from 88668ba to a521089 Compare October 28, 2019 02:07
@dstandish
Copy link
Contributor Author

dstandish commented Oct 28, 2019

OK Just want to offer one more idea.

If we don't like adding connection to mysql hook / postgres hook, we can make simple subclasses in airflow/gcp/hooks/cloud_sql.py. It could even by defined inside CloudSqlDatabaseHook.

So we define something like this:

class _GCPMysqlHook(MySqlHook):
    def __init__(self, connection: Connection, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.connection = connection

    def get_connection(self, conn_id: str) -> Connection:
        return self.connection

And then in get_database_hook it's something like this:

self.db_hook = self._GCPMysqlHook(connection=connection, schema=self.database)

@codecov-io
Copy link

codecov-io commented Oct 28, 2019

Codecov Report

Merging #6440 into master will decrease coverage by 0.35%.
The diff coverage is 71.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6440      +/-   ##
==========================================
- Coverage   83.99%   83.63%   -0.36%     
==========================================
  Files         627      627              
  Lines       36537    36516      -21     
==========================================
- Hits        30690    30541     -149     
- Misses       5847     5975     +128
Impacted Files Coverage Δ
airflow/hooks/mysql_hook.py 92.85% <100%> (+0.1%) ⬆️
airflow/hooks/postgres_hook.py 96.55% <100%> (+0.06%) ⬆️
airflow/gcp/operators/cloud_sql.py 84.61% <40%> (-4.37%) ⬇️
airflow/gcp/hooks/cloud_sql.py 68.42% <80%> (-1.82%) ⬇️
airflow/kubernetes/volume_mount.py 44.44% <0%> (-55.56%) ⬇️
airflow/kubernetes/volume.py 52.94% <0%> (-47.06%) ⬇️
airflow/kubernetes/pod_launcher.py 45.25% <0%> (-46.72%) ⬇️
airflow/kubernetes/kube_client.py 33.33% <0%> (-41.67%) ⬇️
...rflow/contrib/operators/kubernetes_pod_operator.py 70.14% <0%> (-28.36%) ⬇️
airflow/jobs/local_task_job.py 85% <0%> (-5%) ⬇️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 171e1bb...a521089. Read the comment docs.

@ashb
Copy link
Member

ashb commented Oct 28, 2019

Or perhaps better, Creds or Credential. Connection is not really a connection. Come to think of it, this causes a lot of confusing code (all the get_conn in the codebase).

Oooh interesting idea, Connection usually implies and active connection/open socket, but as you say, our Connection objects are just credentials and connection information. Might be a thing to rename pre 2.0, at least in the code an models, if maybe not in the UI?

@dstandish
Copy link
Contributor Author

@potiuk would you be interested in reviewing?

These GCP hooks create connections in database. Changing this makes it more compatible with this PR allowing users to pull creds from other sources.

@potiuk
Copy link
Member

potiuk commented Oct 29, 2019

Yep. Was on my list . Will do it latest tomorrow.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice simplification! Love it.

Just a comment about target package in the light of recent AIP-21 clarifications and rebase is needed.

Ans sorry for so late review.

tests/gcp/hooks/test_cloud_sql.py Show resolved Hide resolved
* add optional param `connection` to postgres and mysql hooks
* instead of storing ephemeral connection to db, just pass directly to hook
* remove obsoleted tests
@dstandish dstandish force-pushed the feature/gcp-do-not-store-ephemeral-conn-in-db branch from e6a4382 to 8f31a8c Compare November 11, 2019 21:39
@dstandish
Copy link
Contributor Author

thanks @potiuk

i have rebased

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like it. Sorry it took so long!

tests/gcp/hooks/test_cloud_sql.py Show resolved Hide resolved
@potiuk potiuk merged commit 776e24a into apache:master Nov 14, 2019
@dstandish dstandish deleted the feature/gcp-do-not-store-ephemeral-conn-in-db branch December 10, 2019 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants